-
Notifications
You must be signed in to change notification settings - Fork 185
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Restrict Node Label Matching for Supported Topology Keys in NodeTopologyMap #2965
Restrict Node Label Matching for Supported Topology Keys in NodeTopologyMap #2965
Conversation
9f8e66d
to
2bf7271
Compare
api/v1/topologymap.go
Outdated
// GetKeyValues returns a node label matching the topologyKey and all values for StrechCluster | ||
// for that label across all storage nodes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please update the comment as per the working of this function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
api/v1/topologymap.go
Outdated
for label, labelValues := range m.Labels { | ||
if label == corev1.LabelZoneFailureDomainStable || label == labelZoneFailureDomainWithoutBeta { | ||
topologyKey = label | ||
values = labelValues | ||
break | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This means for stretch cluster we can't have a different failure domain? Like, rack
, host
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should update the GetKeysValues
function to do an equality check with valid labels that we define in the validTopologyLabelKeys
and return the key, values accordingly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@malayparida2000 @iamniting what do you guys think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree with @Nikhil-Ladha
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Stretch always uses zones, just to confirm in this thread in addition to comments from others that confirmed the same.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@OdedViner Can you add some details on what's the problem with the existing code? And what's the exact issue. It's difficult to understand the motive for the change without knowing the root cause you are trying to fix.
api/v1/topologymap.go
Outdated
"strings" | ||
) | ||
|
||
const labelZoneFailureDomainWithoutBeta = "failure-domain.kubernetes.io/zone" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wasn't this already defined somewhere in the codebase?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The variable is defined in another package within the project, but it is a private variable (starts with a lowercase letter). Since it cannot be accessed outside its package.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I saw our docs & went as far back as OCS 4.7. And I can confirm 2 things.
- Stretch cluster always uses failure domain zone
- For zone failure domain always
topology.kubernetes.io/zone
label is used. I don't see any need to search for any other labels such asfailure-domain.kubernetes.io/zone
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed with Malay on all these points. failure-domain.kubernetes.io/zone
is an old deprecated label for topology that was removed from K8s many releases ago, we should not use it anymore.
Root Cause: Impact: Proposed Solution: To address this issue, I introduced a new function, GetKeyValuesStrechCluster, which specifically checks for relevant and valid topology keys (corev1.LabelZoneFailureDomainStable and labelZoneFailureDomainWithoutBeta). This ensures only the intended labels are considered, preventing incorrect matches and maintaining the integrity of the stretch cluster configuration. Why This Fix Is Necessary: @malayparida2000 @Nikhil-Ladha @iamniting @sp98 |
|
Yes. It's always zone. |
In the whole of ODF we have only 3 supported kind of failure domains rack, hostname, and zone. |
99f9f7d
to
0796ed4
Compare
@OdedViner This fix selectively addresses the case of stretch cluster only, while the mis match of labels can still happen to a normal cluster if the customer uses an improperly formatted label. I would suggest fixing the issue properly for all cases. |
0796ed4
to
8f85424
Compare
cb85115
to
789e697
Compare
Restrict topology key matching to explicitly supported labels to prevent incorrect updates and ensure reliable cluster configuration. Signed-off-by: Oded Viner <[email protected]>
789e697
to
d60ecd7
Compare
The fix looks good now, @OdedViner Can you please test this with a cluster & let us know the results here. |
The private image seems to have resolved the issue Procedure: 2.Add "topology.kubernetes.io/zone-principal=true" label to all worker nodes
4.Check Storagecluster status:
Failure Domain Key: topology.kubernetes.io/zone-principal
5.Create private image based relase-4.18 branch
image location: 7.Change csv image
10.Check storagecluster status
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: iamniting, malayparida2000, OdedViner The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/retest |
1085e40
into
red-hat-storage:main
/cherry-pick release-4.18 |
@malayparida2000: new pull request created: #2993 In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
The OCS operator used strings. Contains to match topology keys, leading to incorrect matches (e.g.,
topology.kubernetes.io/zone-principal
instead oftopology.kubernetes.io/zone
). This caused invalid updates to nodeTopologies in the StorageCluster spec, breaking stretch cluster configuration and resulting in Rook operator failure.Fix:
Updated the
GetKeyValues
method to explicitly match supported topology keys (rack
,hostname
,zone
) using a predefined map. Unsupported or unmatched keys now return empty results.Impact:
StorageCluster
andCephCluster
configurations.